WBS of Q1

Set up

Environment Set up

Function Set up

Function: read_csv_from_zip

Unzips a specified ZIP file to read a CSV file while keeping specified columns.

Usage: read_csv_from_zip(zip_filepath, csv_filename, columns_to_keep)

Function: split_data

Splits a dataset into training and test sets based on a specified training ratio.

Usage: split_data(data, train_ratio = 0.8)

# Set a seed for reproducibility and to minimize RAM usage
set.seed(62380486) 
# Function to split data into training and test sets
split_data <- function(data, train_ratio = 0.8) {
  # validate train_ratio range
  if (train_ratio <= 0 || train_ratio >= 1) {
  stop("Error: train_ratio must be between 0 and 1 (exclusive).")
}
  # Randomly select the specified percentage of indices for the training set
  train_ind <- sample(1:nrow(data), 
                      size = floor(train_ratio * nrow(data)),
                      replace = FALSE)
  
  # Use the remaining indices for the test set
  test_ind <- setdiff(1:nrow(data), train_ind)
  
  # Create training data using the selected indices
  train_data <- data[train_ind, , drop = FALSE]
  rownames(train_data) <- NULL

  # Create test data using the remaining indices
  test_data <- data[test_ind, , drop = FALSE]
  rownames(test_data) <- NULL
  
  # Return both training and test data as a list
  return(list(train = train_data, test = test_data))
}

Data Preparation

Load Data and DEA Check

# load data from dataset using common function read_csv_from_zip
xdata <- read_csv_from_zip(zip_filepath = "./data/data_assignment_2.zip",
                          csv_filename = "heart.csv",
                          columns_to_keep = c("DEATH", "GLUCOSE", "SYSBP")
                          )

skim(xdata)
Data summary
Name xdata
Number of rows 11627
Number of columns 3
_______________________
Column type frequency:
numeric 3
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
DEATH 0 1.00 0.30 0.46 0.0 0 0 1 1 ▇▁▁▁▃
GLUCOSE 1440 0.88 84.12 24.99 39.0 72 80 89 478 ▇▁▁▁▁
SYSBP 0 1.00 136.32 22.80 83.5 120 132 149 295 ▆▇▁▁▁

Missing Data Imputing

Here, we noticed that GLUCOSE has 1440 missing value, accounting for 12.38% of total rows. We are adopting a Median Imputation strategy after comparing below methods’ defects.

  • Complete Case Deletion: The 12.38% missing data proportion is too high, risking significant information loss and bias.

  • Regression/Classification Imputation: Using other variables (like SYSBP and DEATH) to predict GLUCOSE introduces data leakage when DEATH is the target variable.

  • Missing Values as a Separate Category: This approach is unsuitable for numerical features like GLUCOSE as it can distort the variable’s distribution and introduce bias.

# Calculate the median of the GLUCOSE variable, ignoring NA values
glucose_median <- median(xdata$GLUCOSE, na.rm = TRUE)

# Impute the missing values in GLUCOSE with the calculated median
xdata$GLUCOSE[is.na(xdata$GLUCOSE)] <- glucose_median

# Check missing data with skim method, this time, GLUCOSE's missing shoudl count 0
skim(xdata)
Data summary
Name xdata
Number of rows 11627
Number of columns 3
_______________________
Column type frequency:
numeric 3
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
DEATH 0 1 0.30 0.46 0.0 0 0 1 1 ▇▁▁▁▃
GLUCOSE 0 1 83.61 23.43 39.0 73 80 87 478 ▇▁▁▁▁
SYSBP 0 1 136.32 22.80 83.5 120 132 149 295 ▆▇▁▁▁

Data Splitting

Q(a). Split the dataset into a training set (80% of entries) and a test set (20% of entries).

sp_data <- split_data(data=xdata, train_ratio = 0.8)
sp_data
## $train
## # A tibble: 9,301 × 3
##    DEATH GLUCOSE SYSBP
##    <dbl>   <dbl> <dbl>
##  1     0      81  137 
##  2     0      76  125 
##  3     0     101  147 
##  4     1      87  120 
##  5     0      71  154.
##  6     1      64  138 
##  7     1      80  145 
##  8     0      78  116 
##  9     0      76   99 
## 10     0      85  131 
## # ℹ 9,291 more rows
## 
## $test
## # A tibble: 2,326 × 3
##    DEATH GLUCOSE SYSBP
##    <dbl>   <dbl> <dbl>
##  1     1     103  150 
##  2     0      85  138 
##  3     0      98  149 
##  4     0      80  110.
##  5     0      79  142.
##  6     0      83  120 
##  7     1      70  140 
##  8     0      72  138 
##  9     0      80  140.
## 10     0      91  144.
## # ℹ 2,316 more rows

Data Visualisation

Q(b). Visualise the relationship between DEATH , GLUCOSE and SYSBP (s a suitable way. Form an initial hypothesis of what to look for when doing the classification.

library(plotly)

fig <- plot_ly(xdata, x = ~GLUCOSE, y = ~SYSBP, z = ~DEATH,
               color = ~factor(DEATH),  # Convert DEATH to a factor for coloring
               colors = c("blue", "red"),
               marker = list(size = 2),
               symbol = "DEATH",
               alpha = 0.45,
               type = "scatter3d",
               mode = "markers",
              # Add mouse hover text
               text = ~paste("GLUCOSE:", GLUCOSE, "<br>SYSBP:", SYSBP, "<br>DEATH:", DEATH)
               ) 


fig <- fig %>% layout(
  scene = list(
    xaxis = list(title = "GLUCOSE"),
    yaxis = list(title = "SYSBP"),
    zaxis = list(title = "DEATH")
  ))

fig  # show figure

Binary Logistic Regression Model

Hypothesis Formation

We will use a function to apply on GLUCOSE and SYSBP to get the probability of DEATH being “1” (or “0”, they are almost the same since it’s a binary situation), this can be written as \[Pr(G\ = 1|X)\]

where \(X\) is a vector, containing features \(x_1: GLUCOSE, x_2: SYSBP\); \(G\) represents the output binary variable DEATH.

To make sure the function will always return the result between 0 and 1, we can use the following sigmoid function to estimate the probability of being class 1:

\[g_1(x) = \mathbb P(G = 1| X=x) = \frac{\exp(b_0 + b_1x_1 + b_2x_2)}{1 + \exp(b_0 + b_1 x + b_2x_2)} \] so the probability of being class 0 would be: \[ g_0(x) = \mathbb P(G = 0| X=x) = \frac{1}{1 + \exp(b_0 + b_1 x + b_2x_2)} \]

The problem now is to find the optimal combination of b0, b1 and b2 from the training set.

Fit Logistic Regression Model

Q(c). On the training set, fit a (multiple) logistic regression model.

N.B. In this question, you are allowed to use glm.

logreg.fit <- glm(
    formula = DEATH ~ GLUCOSE + SYSBP,
    family = binomial,
    data = sp_data$train, 
    na.action = na.omit,
    model = TRUE,
    method = "glm.fit",
    x = FALSE,
    y = TRUE,
    contrasts = NULL
)
summary(logreg.fit)
## 
## Call:
## glm(formula = DEATH ~ GLUCOSE + SYSBP, family = binomial, data = sp_data$train, 
##     na.action = na.omit, model = TRUE, method = "glm.fit", x = FALSE, 
##     y = TRUE, contrasts = NULL)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.832113   0.167240 -28.893  < 2e-16 ***
## GLUCOSE      0.007911   0.001067   7.413 1.23e-13 ***
## SYSBP        0.024086   0.001047  23.008  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11408  on 9300  degrees of freedom
## Residual deviance: 10722  on 9298  degrees of freedom
## AIC: 10728
## 
## Number of Fisher Scoring iterations: 4

Apply Trained Model to Test Dataset

# 1. Predict probabilities (using the test set)
predicted_probabilities <- predict(logreg.fit, newdata = sp_data$test, type = "response")

# 2. Set the threshold
threshold <- 0.5

# 3. Convert to binary classification
predicted_classes <- ifelse(predicted_probabilities > threshold, 1, 0)

# 4. Calculate the confusion matrix (using the test set)
confusion_matrix <- table(Actual = sp_data$test$DEATH, Predicted = predicted_classes)

# 5. Calculate the misclassification rate
misclassification_rate <- (confusion_matrix[1, 2] + confusion_matrix[2, 1]) / sum(confusion_matrix)

Q(c).i Compute the misclassification rates on the test set

misclassification_rate
## [1] 0.2919175

Q(c).ii Compute the confusion matrix on the test set

confusion_matrix
##       Predicted
## Actual    0    1
##      0 1533   83
##      1  596  114

Q(c).iii Visualise your fitted classification models suitable

library(plotly)

# 1. 创建一个网格来覆盖 GLUCOSE 和 SYSBP 的范围
glucose_range <- seq(min(sp_data$train$GLUCOSE, na.rm = TRUE), max(sp_data$train$GLUCOSE, na.rm = TRUE), length.out = 50)
sysbp_range <- seq(min(sp_data$train$SYSBP, na.rm = TRUE), max(sp_data$train$SYSBP, na.rm = TRUE), length.out = 50)
grid <- expand.grid(GLUCOSE = glucose_range, SYSBP = sysbp_range)

# 2. 使用模型预测网格上的概率
grid$predicted_probability <- predict(logreg.fit, newdata = grid, type = "response")

# 3. 将预测概率转换为矩阵
probability_matrix <- matrix(grid$predicted_probability, nrow = length(glucose_range), ncol = length(sysbp_range))

# 4. 创建 2D 等高线图
fig <- plot_ly(
  x = glucose_range,
  y = sysbp_range,
  z = probability_matrix,
  type = "contour",
  contours = list(
    showlabels = TRUE,
    labelfont = list(
      size = 12,
      color = "black"
    ),
    start = 0,
    end = 1,
    size = 0.1
  )
) %>%
  layout(
    title = "2D Decision Boundary with Test Data",
    xaxis = list(title = "GLUCOSE"),
    yaxis = list(title = "SYSBP")
  )

# 5. 添加测试数据集的点,并使用 DEATH 变量作为颜色和悬停文本
fig <- fig %>% add_trace(
  data = sp_data$test,
  x = ~GLUCOSE,
  y = ~SYSBP,
  type = "scatter",
  mode = "markers",
  marker = list(
    size = 5,
    color = ifelse(sp_data$test$DEATH == 1, "red", "blue"),
    opacity = 0.8
  ),
  text = ~ifelse(DEATH == 1, "DEATH=1", "DEATH=0"),  # 设置悬停文本
  hoverinfo = "text",  # 只显示悬停文本
  name = "Test Data"
)

# 6. 添加图例 (可选,可以调整位置)
fig <- fig %>% layout(
  showlegend = TRUE,
  legend = list(
    x = 0.85,  # 图例的 x 坐标 (0-1)
    y = 0.85   # 图例的 y 坐标 (0-1)
  )
)

# 7. 显示图形
fig

Q(c).iii Make a comment or observation regarding goodness of fit